fix(distillation): reverse-KL server path NaN on variable completion length by k1064190 · Pull Request #2 · k1064190/trl

k1064190 · 2026-04-19T11:49:52Z

What does this PR do?

Fixes a NaN-gradient bug in DistillationTrainer's server-backed reverse-KL / generalized JSD loss when batches contain per-sample completion lengths that differ.

Trigger: use_teacher_server=True + beta > 0 + per_device_train_batch_size * gradient_accumulation_steps > 1 with variable completion lengths. Forward loss is finite (clamped by nan_to_num); grad_norm=nan on the first optim step.

Root cause: _get_teacher_token_logprobs_from_server pads rectangular teacher logprobs with -inf. The forward-KL server path (_compute_server_forward_kl_loss) masks the sentinel before the divergence math via valid = teacher > -inf + torch.where + a support mask threaded through _add_tail_bucket. The reverse-KL path skips this masking. Unmasked -inf flows through _add_tail_bucket (producing [-inf, 0]) and _jsd_divergence (producing +inf in forward, clamped by nan_to_num, but NaN in backward — autograd's chain rule does not respect nan_to_num). Both paths landed together in huggingface#5407; the asymmetric masking looks like an oversight.

Fix: In _compute_server_sparse_top_1_divergence_loss, after the existing isfinite validation, neutralise the sentinel at known padding positions (labels == -100) with a finite zero via torch.where, before the shared divergence helper runs. The label mask in _reduce_divergence_loss continues to exclude these positions from the final loss.

Tests: New tests/experimental/test_distillation_trainer.py (trainer had no dedicated tests):

sentinel contract at the server getter,
mask pattern in isolation: _add_tail_bucket + _jsd_divergence(beta=1) post-mask, finite forward & backward,
end-to-end DistillationTrainer.train() at bs=1, ga=2 with variable-length dataset and mocked VLLMClient for beta=1.0 and beta=0.5.

pytest tests/experimental/test_distillation_trainer.py -v: 4/4 pass in 28.12s.

Env (trl env):

- Platform: Linux-5.14.0-427.22.1.el9_4.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.15
- TRL version: 1.3.0.dev0+3c0d9ae
- PyTorch version: 2.10.0+cu130
- accelerator(s): NVIDIA RTX PRO 6000 Blackwell Server Edition x3
- Transformers version: 4.57.3
- Accelerate version: 1.13.0
- Datasets version: 4.8.4
- HF Hub version: 0.36.2
- bitsandbytes version: 0.49.2
- DeepSpeed version: 0.18.9
- Liger-Kernel version: 0.7.0
- PEFT version: 0.19.1
- vLLM version: 0.17.1

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

…length When ``use_teacher_server=True`` with ``beta > 0`` and ``bs * grad_accum > 1``, the reverse-KL server path leaked NaN into the backward pass whenever per-sample completion lengths differed within a batch. Root cause ---------- ``_get_teacher_token_logprobs_from_server`` fills the rectangular (B, T) output tensor with the TRL house sentinel ``float("-inf")`` at intra-batch padding positions (the tail of shorter samples). The forward-KL server path (``_compute_server_forward_kl_loss``) neutralises this via ``torch.where(teacher > -inf, ..., -inf)`` plus a support mask threaded through ``_add_tail_bucket``; the reverse-KL server path (``_compute_server_sparse_top_1_divergence_loss``) did not. Both paths landed in the same commit (huggingface#5407) -- an oversight, not deliberate asymmetry. Unmasked, the -inf sentinel produces a teacher distribution [-inf, 0] after ``_add_tail_bucket`` and +inf in ``_jsd_divergence``'s forward pass (clamped to ``finfo.max`` by ``nan_to_num``), but NaN in the backward pass: autograd's chain rule does not respect ``nan_to_num``, so the pre-clamp +inf leaks through as NaN gradient. Fix --- Mirror the forward-KL server path's masking: after the ``isfinite`` checks that guard required positions, replace the -inf sentinel with a finite zero at all known padding positions (``labels == -100``) via ``torch.where``. The label mask in ``_reduce_divergence_loss`` still excludes those positions from the final loss; the new neutralisation prevents their -inf values from propagating through ``_add_tail_bucket`` and ``_jsd_divergence`` into the autograd graph. Tests ----- ``tests/experimental/test_distillation_trainer.py`` is new (DistillationTrainer had zero dedicated tests at v1.1.0): - Sentinel contract at the server-path getter. - The reverse-KL mask pattern produces finite forward AND backward on a ragged batch. - End-to-end training step under ``per_device_train_batch_size=1``, ``gradient_accumulation_steps=2``, variable completion lengths, with a mocked ``VLLMClient``. Covers ``beta=1.0`` (reverse KL) and ``beta=0.5`` (JSD). Reproduction pre-fix: ``grad_norm=nan`` on step 1. Reproduction post-fix: ``grad_norm`` finite; padding positions receive zero gradient (correctly excluded from the learning signal). A parallel audit of GKDTrainer confirmed it is not vulnerable to the same class of bug: its teacher runs in-process on a dense rectangular batch, with no HTTP ragged-to-rectangular reassembly and no -inf sentinel in the GKD loss path. Refs: huggingface#5407.

Copilot

Pull request overview

Fixes a NaN-gradient issue in the experimental distillation trainer’s server-backed reverse-KL / generalized JSD loss when batches contain variable completion lengths, by neutralizing -inf padding sentinels before divergence math runs.

Changes:

Add masking in _compute_server_sparse_top_1_divergence_loss to replace teacher -inf sentinels at labels == -100 positions with finite zeros.
Clarify the -inf sentinel contract and where it is neutralized downstream.
Add a new regression test suite covering sentinel padding, finite forward/backward behavior, and an end-to-end train() run with ragged completion lengths using a mocked VLLMClient.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`trl/experimental/distillation/distillation_trainer.py`	Neutralizes `-inf` sentinels at ignored label positions for the server reverse-KL/JSD path to prevent NaN gradients.
`tests/experimental/test_distillation_trainer.py`	Adds unit + functional regression tests validating the sentinel contract and guarding against non-finite backward passes under variable completion lengths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Collapse the module summary, triple-line test docstrings, and the one-shot helper factories in `tests/experimental/test_distillation_trainer.py` into the repo's terse style. Functional coverage (sentinel pin, mid-level mask finite forward/backward, end-to-end train() under bs*ga>1 with ragged batches for beta=1.0 and beta=0.5) is unchanged; all 4 tests still pass.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Experiments showed the end-to-end regression tests were miscalibrated: - `bs=1, ga=2` and `bs=2, ga=1` both reproduce `grad_norm=nan` when the fix is removed (because `_get_teacher_token_logprobs_from_server` emits -inf padding not only for cross-sample ragged batches but also via per-sample `completion_offsets`). Parametrize the reverse-KL test over both configs for fuller coverage. - `beta=0.5` (JSD mixture) does not actually produce NaN without the fix in either config: `_jsd_divergence`'s mixture branch routes student gradients through `log((1-beta)*student_probs + beta*teacher_probs)`, which stays finite when teacher_probs=0 at padding. Drop the JSD end-to-end test — it was a vacuous guard. Unit + mid-level tests (sentinel contract, mask-keeps-forward-and- backward-finite) are unchanged.

…ngface#5580)

…letion

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

…oved coverage (huggingface#5537)

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

- Trim padding-mask comment to two lines focused on what it prevents; the backward-autograd exposition lived in the PR description. - Drop the explicit `zero` scalar tensor — `torch.where` broadcasts the `0.0` literal to the tensor's dtype/device (verified bit-exact equivalent in fp32/bf16/fp16). - Mark the end-to-end `trainer.train()` test `@pytest.mark.slow` to match repo convention for heavy tests (saves ~8s per warm CI run).

…uggingface#5538) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

…cking (huggingface#5618)

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

…plate (huggingface#5519) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

huggingface#5523) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

…torForKTO (huggingface#5612)

…ngface#5526) Co-authored-by: Rudrendu <RudrenduPaul@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

)

…gface#5631)

… data collator (huggingface#5632)

…imental trainers (huggingface#5665)

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

)

…letion

k1064190 marked this pull request as ready for review April 19, 2026 12:06

Copilot AI review requested due to automatic review settings April 19, 2026 12:06

Copilot started reviewing on behalf of k1064190 April 19, 2026 12:06 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

k1064190 and others added 25 commits April 19, 2026 21:31

test(distillation): guard end-to-end tests against vacuous log-history

d3f6a18

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update AsyncGRPO example with GSM8K and tested hyperparameters (huggi…

88826fd

…ngface#5580)

Merge branch 'main' into fix/distillation-server-nan-on-variable-comp…

badeb47

…letion

[docs] Add chat templates page to web docs (huggingface#5581)

1d9b612

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Add additional model parameters to TestSupportsToolCalling for impr…

9502575

…oved coverage (huggingface#5537)

Fix CI with dev dependencies for Llava models (huggingface#5499)

06244b0

Differentiate Phi-3 and Phi-3.5 in tests (huggingface#5546)

4a2dc7c

Set _tokenizer as trainer attribute (huggingface#5489)

6e1705a

Align KTO with DPO: Support dict eval_dataset (huggingface#5599)

b8d69f7

Align KTO with DPO: Align tokenization (huggingface#5601)

4ca2e9b

Check prefix preservation at the token level (huggingface#5559)

d5b534e

Replace wrong comment about chat template with EOS (huggingface#5607)

dfe3788

Align KTO with DPO: Support IterableDataset (huggingface#5600)

14ca4af

Drop vLLM 0.11 support (huggingface#5549)

0a54b4d

Align KTO with DPO: Remove maybe_apply_chat_template (huggingface#5606)

1cc2b98

[TPO] experimental TPO trainer (huggingface#5506)

ecf9cb3

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker (h…

a08e713

…uggingface#5538) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

docs: update RapidFire AI integration with FSDP and multi-backend tra…

166d550

…cking (huggingface#5618)

Fix generate_tiny_models for gpt-oss (huggingface#5622)

edaf6ec

Added speculative_config to vllm-serve (huggingface#5605)

6a4a077

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

feat(glm-4-moe): Add {% generation %} markers for training chat tem…

9a52d73

…plate (huggingface#5519) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Fix docstring style in vllm-serve script (huggingface#5628)

95e76d5

ps-abhi and others added 28 commits April 22, 2026 13:44

feat: add Gemma/Gemma2 training chat templates with generation markers (

3256995

huggingface#5523) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

Align KTO with DPO: Inline tokenization, new output format, DataColla…

b3da4eb

…torForKTO (huggingface#5612)

Remove forward_masked_logits (huggingface#5626)

6da8ec5

Use PreTrainedTokenizerBase for tokenizer type hints (huggingface#5629

a9cfe47

)

Add doc-builder style check to pre-commit and CI (huggingface#5630)

1996c39

Align and update doc-builder commit hash in CI GitHub Actions (huggin…

b43476a

…gface#5631)

Align KTO with DPO: Move completion assembly from _prepare_dataset to…

4c8b2e9

… data collator (huggingface#5632)

Hotfix CI: Add ruff dependency to doc-builder style check (huggingfac…

208337c

…e#5634)

Fix entropy calculation in SFT (huggingface#5620)

c693ca1

Renaming of internal variables: async_reward_X to async_X (huggin…

43cbd78

…gface#5616)

Align KTO with DPO: Remove BOS/EOS handling (huggingface#5635)

3aa9519

Qwen3.6 integration (huggingface#5642)

2f10689

Release: v1.3 (huggingface#5647)

9679645

⬆️ Bump dev version (huggingface#5648)

4798893

Align KTO with DPO: Remove model_init parameter (huggingface#5659)

923c318

Align KTO with DPO: Remove preprocess_logits_for_metrics parameter (h…

510a6f5

…uggingface#5660)

Add tiny Qwen3-4B-Instruct-2507 (huggingface#5586)

a7648ba

Chunked cross-entropy loss for SFT (up to –50% VRAM) (huggingface#5575)

9bcf729

Fix missing PEFT validation when passing peft_config to core trainers (…

8d3a3a2

…huggingface#5664)

Fix missing PEFT availability check when passing peft_config to exper…

4d0fd7d

…imental trainers (huggingface#5665)

Align KTO with DPO: Align PEFT handling (huggingface#5661)

9516563

Set _tokenizer attribute in experimental trainers (huggingface#5566)

4455858

Fix peft_config type hint in experimental trainers (huggingface#5666)

574ebe0

Add Cohere training chat template (huggingface#5627)

788555a

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>

Simplify peft_config handling in core trainers (huggingface#5673)

88e0ed4

Simplify peft_config handling in experimental trainers (huggingface#5674

fdad6d8

)

Merge branch 'main' into fix/distillation-server-nan-on-variable-comp…

f85334a

…letion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distillation): reverse-KL server path NaN on variable completion length#2

fix(distillation): reverse-KL server path NaN on variable completion length#2
k1064190 wants to merge 54 commits intomainfrom
fix/distillation-server-nan-on-variable-completion

k1064190 commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Conversation

k1064190 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

k1064190 commented Apr 19, 2026 •

edited

Loading